-
Notifications
You must be signed in to change notification settings - Fork 49
Implement instructions for matching builtin character classes and assertions #547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WIP because I need to figure out these weird performance behaviours Weird thing number 1: The first commit in this branch (copying over the AsciiBitset changes from the matchScalar PR) is resulting in significant speedups
I sanity checked this by trying to minimize the change (not copying it over to a new file, not adding the new scalar api) and got the same result.
Some quick profiling seems to indicate that the performance increase came the call to |
Weird performance thing number 2: The optimization is causing regressions in regexes that do not produce the new instruction and very little improvement in the ones that do Comparing with the first commit on the branch we get larger regressions in all of the email regexes and smaller ones in almost every other result
|
Removing the remnants of the bitset fast path optimization I was testing seems to have reverted most of the regressions
But why?
Need to make 100% sure that compile time is not affecting the benchmark times. A lot of this weirdness could also just be due to inaccurate timing from the benchmarker and variance because I'm running everything locally |
@swift-ci test |
case .verticalWhitespace: | ||
matched = c.isNewline && (c.isASCII || !isStrictAscii) | ||
case .newlineSequence: | ||
matched = c.isNewline && (c.isASCII || !isStrictAscii) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was carried over from the old code, but why does this match a full \r\n
sequence when in unicode scalars mode? Should we change that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's what is specified in the Unicode proposal:
Additionally,
\R
andCharacterClass.newline
provide a way to include the CR+LF pair, even when matching with Unicode scalar semantics.
I don't know the exact rationale behind it, but I would assume it's done in an effort to be compatible with other regex engines that match scalars by default but with \R
still matching \r\n
.
return currentPosition == subjectBounds.upperBound | ||
} | ||
|
||
case .wordBoundary: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be no test cases that use the usesSimpleUnicodeBoundaries
option, should we be testing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a couple: testSimpleWordBoundaries
and https://github.com/apple/swift-experimental-string-processing/blob/96fb215facf425226ca0645d5cab6e84a3befece/Tests/RegexBuilderTests/RegexDSLTests.swift#L540 but the latter was xfail, I have to fix the test because the test is wrong.
@swift-ci test |
switch semanticLevel { | ||
case .graphemeCluster: | ||
return input.index(after: currentPosition) == subjectBounds.upperBound | ||
&& input[currentPosition].isNewline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @Azoy , these are more examples of a consume operation that we want for more efficient access to String. Something like func nom() -> (current: Character, next: Index)
return currentPosition == subjectBounds.upperBound | ||
} | ||
|
||
case .wordBoundary: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely should be.
I may have gotten a little carried away while updating this branch to match main (internalization of It seems to all be working but there's a slight hiccup in inverting character classes that I'll work out on Monday |
@swift-ci test |
@swift-ci test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In progress review, but plenty to communicate. Looks great! Lots of little tweaks and clarifications but this is awesome stuff!
@@ -162,6 +162,8 @@ extension DSLTree.Atom { | |||
case .assertion: | |||
// TODO: We could handle, should this be total? | |||
return nil | |||
case .characterClass(let cc): | |||
return cc.generateConsumer(opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future: get rid of all these consumers except custom ones like custom character classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason why all these consumer functions + _CharacterClassModel.matches
still exist is because we have to emit any non-ascii CustomCharacterClass as one big consumer so it needs to know how to create a consumer for each possible member type (which includes Atom).
Do you think we could emit it as an ordered choice over its members instead? I think the main thing to add would be instructions for handling the custom character class set operations, and if we had that then we could rip out all of the non character property related stuff in ConsumerInterface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid emitting backtracking traffic for custom character classes if we can. However, since there's a compilation boundary around each member anyways, that might not account for much. Let's keep this for now and try to come back to it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider is that if we keep non-ascii custom character classes as one big consumer we're unable to access any optimized matching for any of its members.
Also just no longer having all the matching logic exactly duplicated across ConsumerInterface
and Processor
seems like something pretty important for future extensibility/preventing bugs. We pay some cost for emitting backtracking traffic but this is mostly for uncommon cases, the most common ascii cases are already handled by the bitset.
} | ||
|
||
func isAtEndOfLine(_ payload: AssertionPayload) -> Bool { | ||
if currentPosition == subjectBounds.upperBound { return true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... similarly non-aligned upper bounds.
switch payload.semanticLevel { | ||
case .graphemeCluster: | ||
return input.index(after: currentPosition) == subjectBounds.upperBound | ||
&& input[currentPosition].isNewline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, alignment?
case .endOfSubject: return currentPosition == subjectBounds.upperBound | ||
|
||
case .resetStartOfMatch: | ||
fatalError("Unreachable, we should have thrown an error during compilation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future: we'll want to version support so that we can reason about back deployment
- static vars in payloads - Clean up _CharacterClassModel - Use the model for bytecodegen and consumer interface - Merge the grapheme and scalar match builtin cases together
@swift-ci test |
matchBuiltin
to match builtins instead of a consumeFn calling_CharacterClassModel.matches
in most casesAssertionFunction
and changesassertBy
to match inProcessor
insteadThere are some gains (10% in microbenchmarks on builtin character classes) but the main purpose is to set up for future work
Future work:
matchBuiltin
andassertBy
will be optimized (hopefully soon) along with the rest of the instruction layout. Changes made in this branch and the scalar optimization branch showed that this is a strong next step_CharacterClassModel.matches
but I think we're getting very close to being able to rip out most of it and the consumer interface forCustomCharacterClass